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O ' Networks or graphs can easily represent a diverse set of data 

[] ' sources that are characterized by interacting units or actors. Social 

networks, representing people who communicate with each other, are 
one example. Communities or clusters of highly connected actors form 
an essential feature in the structure of several empirical networks. 
Spectral clustering is a popular and computationally feasible method 
l_J ' to discover these communities. 

The stochastic blockmodel [Social Networks 5 (1983) 109-137] is 
a social network model with well-defined communities; each node 
is a member of one community. For a network generated from the 
Stochastic Blockmodel, we bound the number of nodes "misclus- 
^ , • tered" by spectral clustering. The asymptotic results in this paper 

are the first clustering results that allow the number of clusters in 
fT^ . the model to grow with the number of nodes, hence the name high- 

^ ' dimensional. 

^^ I In order to study spectral clustering under the stochastic block- 

00 ■ model, we first show that under the more general latent space model, 

>0 ' the eigenvectors of the normalized graph Laplacian asymptotically 

converge to the eigenvectors of a "population" normalized graph 
f~^ ■ Laplacian. Aside from the implication for spectral clustering, this 

C^ ' provides insight into a graph visualization technique. Our method of 

("^ , studying the eigenvectors of random matrices is original. 

_^_^ 1. Introduction. Researchers in many fields and businesses in several 

S^ . industries have exploited the recent advances in information technology 

H I to produce an explosion of data on complex systems. Several of the com- 

plex systems have interacting units or actors that networks or graphs can 
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easily represent, providing a range of disciplines with a suite of potential 
questions on how to produce knowledge from network data. Understand- 
ing the system of relationships between people can aid both epidemiolo- 
gists and sociologists. In biology, the predator-prey pursuits in a natural 
environment can be represented by a food web, helping researchers bet- 
ter understand an ecosystem. The chemical reactions between metabolites 
and enzymes in an organism can be portrayed in a metabolic network, pro- 
viding biochemists with a tool to study metabolism. Networks or graphs 
conveniently describe these relationships, necessitating the development of 
statistically sound methodologies for exploring, modeling and interpreting 
networks. 

Communities or clusters of highly connected actors form an essential fea- 
ture in the structure of several empirical networks. The identification of these 
clusters helps answer vital questions in a variety of fields. In the communica- 
tion network of terrorists, a cluster could be a terrorist cell; web pages that 
provide hyperlinks to each other form a community that might host discus- 
sions of a similar topic; and a community or cluster in a social network likely 
shares a similar interest. 

Searching for clusters is algorithmically difficult because it is computation- 
ally intractable to search over all possible clusterings. Even on a relatively 
small graph, one with 100 nodes, the number of different partitions exceeds 
some estimates of the number of atoms in the universe by twenty orders of 
magnitude [Champion (1998)]. For several different applications, physicists, 
computer scientists and statisticians have produced numerous algorithms 
to overcome these computational challenges. Often these algorithms aim to 
discover clusters which are approximately the "best" clusters as measured 
by some empirical objective function [see Fortunato (2010) or Fjallstrom 
(1998) for comprehensive reviews of these algorithms from the physics or 
the engineering perspective, resp.]. 

Clustering algorithms generally come from two sources: from fitting pro- 
cedures for various statistical models that have well-defined communities 
and, more commonly, from heuristics or insights on what network commu- 
nities should look like. This division is analogous to the difference in multi- 
variate data analysis between parametric clustering algorithms, such as an 
EM algorithm fitting a Gaussian mixture model, and nonparametric cluster- 
ing algorithms such as fc-means, which are instead motivated by optimizing 
an objective function. Snijders and Nowicki (1997), Nowicki and Snijders 
(2001), Handcock, Raftery and Tantrum (2007) and Airoldi et al. (2008) all 
attempt to cluster the nodes of a network by fitting various network mod- 
els that have well-defined communities. In contrast, the Girvan-Newman 
algorithm [Girvan and Newman (2002)] and spectral clustering are two al- 
gorithms in a large class of algorithms motivated by insights and heuristics 
on communities in networks. 
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Newman and Girvan (2004) motivate their algorithm by observing, "If 
two communities are joined by only a few inter-community edges, then all 
paths through the network from vertices in one community to vertices in the 
other must pass along one of those few edges." The Girvan-Newman algo- 
rithm searches for these few edges and removes them, resulting in a graph 
with multiple connected components (connected components are clusters of 
nodes such that there are no connections between the clusters) . The Girvan- 
Newman algorithm then returns these connected components as the clusters. 
Like the Girvan-Newman algorithm, spectral clustering is a "nonparamet- 
ric" algorithm motivated by the following insights and heuristics: spectral 
clustering is a convex relaxation of the normalized cut optimization problem 
[Shi and Malik (2000)], it can identify the connected components in a graph 
(if there are any) [Donath and Hoffman (1973), Fiedler (1973)], and it has 
an intimate connection with electrical network theory and random walks on 
graphs [Klein and Randic (1993), Meila and Shi (2001)]. 

1.1. Spectral clustering. Spectral clustering is both popular and com- 
putationally feasible [von Luxburg (2007)]. The algorithm has been redis- 
covered and reapplied in numerous different fields since the initial work of 
Donath and Hoffman (1973) and Fiedler (1973). Computer scientists have 
found many applications for variations of spectral clustering, such as load 
balancing and parallel computations [Van Driessche and Roose (1995), Hen- 
drickson and Leland (1995)], partitioning circuits for very large-scale inte- 
gration design [Hagen and Kahng (1992)] and sparse matrix partitioning 
[Pothen, Simon and Liou (1990)]. Detailed histories of spectral clustering 
can be found in Spielman and Teng (2007) and von Luxburg, Belkin and 
Bousquet (2008). 

The algorithm is defined in terms of a graph G, represented by a vertex 
set and an edge set. The vertex set {vi, . . . ,u„} contains vertices or nodes. 
These are the actors in the systems discussed above. We will refer to node Vi 
as node i. We will only consider unweighted and undirected edges. So, the 
edge set contains a pair (i, j) if there is an edge, or relationship, between 
nodes i and j. The edge set can be represented by the adjacency matrix 
TyG{0,l}"^": 

1, if (z,j) is in the edge set, 
0, otherwise. 

Define L and diagonal matrix D both elements of 7^"^"- in the following 
way: 

(1.2) 



(1.1) Wj, = W,, = S^^ 



k 



L = D'^/'^W D'^/'^ . 
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Some readers may be more familiar defining L as, I — D~^'^WD~^''^. For 
spectral clustering, the difference is immaterial because both definitions have 
the same eigenvectors. 

The spectral clustering algorithm addressed in this paper is defined as 
follows: 



Spectral clustering for k many clusters 

Input: Symmetric adjacency matrix VFg{0,1}"^". 

1. Find the eigenvectors Xi,. . . ,Xi^ gTZ"^ 
corresponding to the k eigenvalues of L that 
are largest in absolute value. L is symmetric, 

so choose these eigenvectors to be orthogonal. Form 
the matrix X = [Xi,...,Xk] G -Jinxk ^y pitting the 
eigenvectors into the columns. 

2. Treating each of the n rows in X as a point 
in TZ^ , run A;-means with k clusters . This creates 

k nonoverlapping sets Ai,. . . ,Ak whose union is l,...,n. 

Output: Ai,...,j4fc. This means that node i is assigned 
to cluster g if the ith row of X is assigned to Ag in 
step 2. 



Traditionally, spectral clustering takes the eigenvectors of L correspond- 
ing to the largest k eigenvalues. The algorithm above takes the largest k 
eigenvalues by absolute value. The reason for this is explained in Section 3. 

Recently, spectral clustering has also been applied in cases where the 
graph G and its adjacency matrix W are not given, but instead inferred 
from a measure of pairwise similarity k{-, •) between data points Xi, . . . , X„ 
in a metric space. The similarity matrix K £ TZ^^^, whose i,jth element is 
Kij = k{Xi,Xj), takes the place of the adjacency matrix W in the above 
definition of L,D, and the spectral clustering algorithm. For image segmen- 
tation, Shi and Malik (2000) suggested spectral clustering on an inferred net- 
work where the nodes are the pixels and the edges are determined by some 
measure of pixel similarity. In this way, spectral clustering has many similari- 
ties with the nonlinear dimension reduction or manifold learning techniques 
such as Diffusion maps and Laplacian eigenmaps [Coifman et al. (2005), 
Belkin and Niyogi (2003)]. 

The normalized graph Laplacian L is an essential part of spectral cluster- 
ing. Diffusion maps and Laplacian eigenmaps. As such, its properties have 
been well studied under the model that the data points are randomly sam- 
pled from a probability distribution, whose support may be a manifold, and 
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the Laplacian is built from the inferred graph based on some measure of sim- 
ilarity between data points. Belkin (2003), Lafon (2004), Bousquet, Chapelle 
and Hein (2004), Hein, Audibert and von Luxburg (2005), Hein (2006), Gine 
and Koltchinskii (2006), Belkin and Niyogi (2008), von Luxburg, Belkin and 
Bousquet (2008) have all shown various forms of asymptotic convergence for 
this graph Laplacian. Although all of their results are encouraging, their re- 
sults do not apply to the random network models we study in this paper. Vu 
(2011) studies how the singular vectors of a matrix change under random 
perturbations. These results are also encouraging. However, the current pa- 
per uses a different method to study the eigenvectors of the graph Laplacian. 

1.2. Statistical estimation. Stochastic models are useful because they 
force us to think clearly about the randomness in the data in a precise and 
possibly familiar way. Many random network models have been proposed 
[Erdos and Renyi (1959), Holland and Leinhardt (1981), Holland, Laskey 
and Leinhardt (1983), Prank and Strauss (1986), Watts and Strogatz (1998), 
Barabasi and Albert (1999), Hoff, Raftery and Handcock (2002), Van Duijn, 
Snijders and Zijlstra (2004), Goldenberg et al. (2010).] Some of these mod- 
els, such as the Stochastic Blockmodel, have well-defined communities. The 
Stochastic Blockmodel is characterized by the fact that each node belongs 
to one of multiple blocks and the probability of a relationship between two 
nodes depends only on the block memberships of the two nodes. If the prob- 
ability of an edge between two nodes in the same block is larger than the 
probability of an edge between two nodes in different blocks, then the blocks 
produce communities in the random networks generated from the model. 

Just as statisticians have studied when least-squares regression can esti- 
mate the "true" regression model, it is natural and important for us to study 
the ability of clustering algorithms to estimate the "true" clusters in a net- 
work model. Understanding when and why a clustering algorithm correctly 
estimates the "true" communities would provide a rigorous understanding 
of the behavior of these algorithms, suggest which algorithm to choose in 
practice, and aid the corroboration of algorithmic output. 

This paper studies the performance of spectral clustering, a nonparamet- 
ric method, on a parametric task of estimating the blocks in the Stochas- 
tic Blockmodel. It connects the first strain of clustering research based on 
stochastic models to the second strain based on heuristics and insights on 
network clusters. The stochastic blockmodel allows for some first steps in 
understanding the behavior of spectral clustering and provides a benchmark 
to measure its performance. However, because this model does not really 
account for the complexities observed in several empirical networks, good 
performance on the Stochastic Blockmodel should only be considered a nec- 
essary requirement for a good clustering algorithm. 
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Researchers have explored the performance of other clustering algorithms 
under the Stochastic Blockmodel. Snijders and Nowicki (1997) showed the 
consistency under the two block Stochastic Blockmodel of a clustering rou- 
tine that clusters the nodes based on their degree distributions. Although 
this clustering is very easy to compute it is not clear that the estimators 
would behave well for larger graphs given the extensive literature on the 
long tail of the degree distribution [Albert and Barabasi (2002)]. Later, Con- 
don and Karp (1999) provided an algorithm and proved that it is consistent 
under the Stochastic Blockmodel, or what they call the planted ^-partition 
model. Their algorithm runs in linear time. However, it always estimates 
clusters that contain an equal number of nodes. More recently, Bickel and 
Chen (2009) proved that under the Stochastic Blockmodel, the maximizers 
of the Newman-Girvan modularity [Newman and Girvan (2004)] and what 
they call the likelihood modularity are asymptotically consistent estimators 
of block partitions. These modularities are objective functions that have no 
clear relationship to the Girvan-Newman algorithm. Finding the maximum 
of the modularities is NP hard [Brandes et al. (2008)]. It is important to note 
that all aforementioned clustering results involving the Stochastic Block- 
model are asymptotic in the number of nodes, with a fixed number of blocks. 

The work of Leskovec et al. (2008) shows that in a diverse set of large 
empirical networks (tens of thousands to millions of nodes), the size of the 
"best" clusters is not very large, around 100 nodes. Modern applications of 
clustering require an asymptotic regime that allows these sorts of clusters. 
Under the asymptotic regime cited in the previous paragraph, the size of 
the clusters grows linearly with the number of nodes. It would be more 
appropriate to allow the number of communities to grow with the number 
of nodes. This restricts the blocks from becoming too large, following the 
empirical observations of Leskovec et al. (2008). 

This paper provides the first asymptotic clustering results that allow the 
number of blocks in the Stochastic Blockmodel to grow with the number of 
nodes. Similar to the asymptotic results on regression techniques that allow 
the number of predictors to grow with the number of nodes, allowing the 
number of blocks to grow makes the problem one of high-dimensional learn- 
ing. Following our initial technical report, Choi, Wolfe and Airoldi (2010) 
also studied community detection under the Stochastic Blockmodel with 
a growing number of blocks. They used a likelihood-based approach, which 
is computationally difficult to implement. However, they are able to greatly 
weaken the assumptions of this paper. 

The Stochastic Blockmodel is an example of the more general latent space 
model [Hoff, Raftery and Handcock (2002)] of a random network. Under the 
latent space model, there are latent i.i.d. vectors zi,...,Zn; one for each 
node. The probability that an edge appears between any two nodes i and j 
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depends only on Zi and Zj and is independent of all other edges and un- 
observed vectors. The results of Aldous and Hoover show that this model 
characterizes the distribution of all infinite random graphs with exchange- 
able nodes [Kallenberg (2005)]. The graphs with n nodes generated from 
a latent space model can be viewed as a subgraph of an infinite graph. In 
order to study spectral clustering under the Stochastic Blockmodel, we first 
show that under the more general latent space model, as the number of 
nodes grows, the eigenvectors of L, the normalized graph Laplacian, con- 
verge to eigenvectors of the "population" normalized graph Laplacian that 
is constructed with a similarity matrix E(VK|^;i, . . . , Zn) (whose i, jth element 
is the probability of an edge between node i and j) taking the place of the 
adjacency matrix W in (1.2). In many ways, E(l^|zi, . . . , Zn) is similar to the 
similarity matrix K discussed above, only this time the vectors (zi, . . . , Zn) 
and their similarity matrix E(Ty|2;i, . . . , Zn) are unobserved. 

The convergence of the eigenvectors has implications beyond spectral clus- 
tering. Graph visualization is an important tool for social network analysts 
looking for structure in networks and the eigenvectors of the graph Lapla- 
cian are an essential piece of one visualization technique [Koren (2005)]. 
Exploratory graph visualization allows researchers to find structure in the 
network; this structure could be communities or something more compli- 
cated [Liotta (2004), Freeman (2000), Wasserman and Faust (1994)]. In 
terms of the latent space model, if zi , . . . , z„ form clusters or have some 
other structure in the latent space, then we might recover this structure 
from the observed graph using graph visualization. Although there are sev- 
eral visualization techniques, there is very little theoretical understanding 
of how these techniques perform under stochastic models of structured net- 
works. Because the eigenvectors of the normalized graph Laplacian converge 
to "population" eigenvectors, this provides support for a visualization tech- 
nique similar to the one proposed in Koren (2005). 

The rest of the paper is organized as follows. The next subsection of the 
Introduction give some preliminary definitions. Following the Introduction, 
there are four main sections; Section 2 studies the latent space model, Sec- 
tion 3 studies the Stochastic Blockmodel as a special case. Section 4 presents 
some simulation results, and Section 5 investigates the plausibility of a key 
assumption in five empirical social networks. Section 2 covers the eigen- 
vectors of L under the latent space model. The main technical result is 
Theorem 2.1 in Section 2, which shows that, as the number of nodes grows, 
the normalized graph Laplacian multiplied by itself converges in Frobenius 
norm to a symmetric version of the population graph Laplacian multiplied 
by itself. The Davis-Kahan theorem then implies that the eigenvectors of 
these matrices are close in an appropriate sense. Lemma 2.1 specifies how the 
eigenvectors of a matrix multiplied by itself are closely related to the eigen- 
vectors of the original matrix. Theorem 2.2 combines Theorem 2.1 with the 
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Davis-Kahan theorem and Lemma 2.1 to show that the eigenvectors of the 
normahzed graph Laplacian converge to the population eigenvectors. Sec- 
tion 3 apphes these results to the high-dimensional Stochastic Blockmodel. 
Lemma 3.1 shows that the population version of spectral clustering can cor- 
rectly identify the blocks in the Stochastic Blockmodel. Theorem 3.1 extends 
this result to the sample version of spectral clustering. It uses Theorem 2.2 
to bound the number of nodes that spectral clustering "misclusters." This 
section concludes with two examples. Section 4 presents three simulations 
that investigate how the asymptotic results apply to finite samples. These 
simulations suggest an area for future research. The main theorems in this 
paper require a strong assumption on the degree distribution. Section 5 in- 
vestigates the plausibility of this assumption with five empirical online social 
networks. The discussion in Section 6 concludes the paper. 

1.3. Preliminaries. The latent space model proposed by Hoff, Raftery 
and Handcock (2002) is a class of a probabilistic model for W. 

Definition 1. For i.i.d. random vectors zi,...,Zn ^'R- and random 
adjacency matrix W € {Ojl}""^", let ¥(Wij\zi,Zj) be the probability mass 
function of Wij conditioned on Zi and Zj . If a probability distribution on W 
has the conditional independence relationships 

¥iW\zu...,Zn)=llnWij\z^,Zj) 
i<j 

and P(Wii = 0) = 1 for all i, then it is called an undirected latent space 
model. 

This model is often simplified to assume F(Wij\zi,Zj) = ¥(Wij\ dist{zi,Zj)) 
where dist(-, •) is some distance function. This allows the "homophily by at- 
tributes" interpretation that edges are more likely to appear between nodes 
whose latent vectors are closer in the latent space. 

Define Z G T^"^^ such that its ith row is Zi for all i ^V . Throughout 
this paper, we assume Z is fixed and unknown. Because P(Wjj = 1\Z) = 
M{Wij\Z), the model is then completely parametrized by the matrix 

where W depends on Z, but this is dropped for notational convenience. 

The Stochastic Blockmodel, introduced by Holland, Laskey and Leinhardt 
(1983), is a specific latent space model with well-defined communities. We 
use the following definition of the undirected Stochastic Blockmodel: 

Definition 2. The Stochastic Blockmodel is a latent space model with 

W = ZBZ'^, 
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where Z € {0, 1}"^^ has exactly one 1 in each row and at least one 1 in each 
column and i3 € [0, 1] is full rank and symmetric. 

We refer to W ^ the matrix which completely parametrizes the latent space 
model, as the population version of W . Define population versions of L 



E 



(1.3) 

where ^ is a diagonal matrix, similar to before. 

The results in this paper are asymptotic in the number of nodes n. When it 
is appropriate, the matrices above are given a superscript of n to emphasize 
this dependence. Other times, this superscript is discarded for notational 
convenience. 

2. Consistency under the latent space model. We will show that the 
empirical eigenvectors of L'"' converge in the appropriate sense to the pop- 
ulation eigenvectors of ^("^ . If L'"^ converged to ^^'^i in Frobenius norm, 
then the Davis-Kahan theorem would give the desired result. However, these 
matrices do not converge. This is illustrated in an example below. Instead, 
we give a novel result showing that under certain conditions L'^'L^"^ con- 
verges to ^y^i ^y^i in Frobenius norm. This implies that the eigenvectors 
of L^'^'L^'^' converge to the eigenvectors of ^("^i ^("^i . The following lemma 
shows that these eigenvectors can be chosen to imply the eigenvectors of L*-") 
converge to the eigenvectors of ^("-^ . 

Lemma 2.1. When M eT?."^" is a symmetric real matrix, 

(1) A^ is an eigenvalue of MM if and only if X or —X is an eigenvalue 
ofM. 

(2) IfMv = Xv, then MMv = X^v. 

(3) Conversely, if MMv = X?v, then v can he written as a linear combi- 
nation of eigenvectors of M whose eigenvalues are X or —X. 

A proof of Lemma 2.1 can be found in Appendix A. 

Example. To see how squaring a matrix helps convergence, let the ma- 
trix W G 7^"X" have i.i.d. Bernoulli(l/2) entries. Because the diagonal ele- 
ments in D grow like n, the matrix W/n behaves similarly to D~^''^W D~^''^ . 
Without squaring the matrix, the Frobenius distance from the matrix to its 
expectation is 



\\W/n - nW)/n\\F = - lY,{Wi,-¥.{Wij)f = 1/2. 



n^ 
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Notice that, for i 7^ j, 

[WW],j = Y, WikWkj ~ Bmomial(n, 1/4) 

k 

and [WW]ii ~ Bmomial(n,l/2). So, for any i,j, [WW]ij - ¥.[WW]ij = 
o{'n}''^ X logn). Thus, the Frobenius distance from the squared matrix to 
its expectation is 

\\WW/n^ - nWW)/n^\\F = ^ /j^([W^H^]., - nWW],,Y = o(^) • 

When the elements of W are i.i.d. Bernoulli(l/2), (W/n)"^ converges in 
Frobenius norm and W/n does not. The next theorem addresses the conver- 
gence of L^L^. 

Define 

(2.1) Tn= min ^"V?^- 

i=l,...,n 
(n) 

Recall that &■■ is the expected degree for node i. So, r„ is the minimum 
expected degree, divided by the maximum possible degree. It measures how 
quickly the number of edges accumulates. 

Theorem 2.1. Define the sequence of random matrices W^''^^ G {0, 1}"^" 
to be from a sequence of latent space models with population matrices W^"'' G 
[O,!]"'^". With W^^^ , define the observed graph Laplacian L^^^ as in (1.2). 
Let ^(") be the population version of L^"-' as defined in (1-3). Define Tn as 
in (2.1). 

If there exists N > 0, such that rj^ logn > 2 for all n> N , then 



\l{n)l{n) _^{n)^(n)u J ^^gn 



^" lr2nl/2 



T^n^ 



a.s. 



Appendix A contains a nonasymptotic bound on \\L^^'L^^' — ^(")^("-) ||^ 
as well as the proof of Theorem 2.1. The main condition in this theorem is 
the lower bound on r„. This sufficient condition is used to produce Gaussian 
tail bounds for each of the Da and other similar quantities. 

For any symmetric matrix M, define A(M) to be the eigenvalues of M 
and for any interval S CTZ, define 

\s{M) = {\{M)nS}. 

Further, define aJ"^ > • • • > A^"^ to be the elements of A(if(")^(")) and 
A^^ > ••• > X'n^ to be the elements of A (L(")l(")). The eigenvalues of L(")l(") 
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converge in the following sense, 
(2.2) 



F 

logn 



^2j^l/2 



a.s. 



This follows from Theorem 2.1, Weyl's inequality [Bhatia (1987)], and the 
fact that the Frobenius norm is an upper bound of the spectral norm. 

This shows that under certain conditions on r„, the eigenvalues of L^^'L^^' 
converge to the eigenvalues of ^("')^(") . In order to study spectral cluster- 
ing, it is now necessary to show that the eigenvectors also converge. The 
Davis-Kahan theorem provides a bound for this. 

Proposition 2.1 (Davis-Kahan). Let S CTZ be an interval. Denote X 
as an orthonormal matrix whose column space is equal to the eigenspace 
of ££^ corresponding to the eigenvalues in \s{^.^) [more formally, the 
column space of X is the image of the spectral projection of ££ ^ induced by 
\s{^^)]- Denote by X the analogous quantity for LL. Define the distance 
between S and the spectrum of ££^ outside of S as 

6 = min{|£ — s\;£ eigenvalue of ^^,i ^ S,s £ S}. 

If X and X are of the same dimension, then there is an orthonormal ma- 
trix O, that depends on X and X , such that 

-\\x-xoy< -^ . 

The original Davis-Kahan theorem bounds the "canonical angle," also 
known as the "principal angle," between the column spaces of X and X. Ap- 
pendix B explains how this can be converted into the bound stated above. To 
understand why the orthonormal matrix O is included, imagine the situation 
that L = ^ . In this case X is not necessarily equal to ^. At a minimum, 
the columns of X could be a permuted version of those in S^ . If there are 
any eigenvalues with multiplicity greater than one, these problems could 
be slightly more involved. The matrix O removes these inconveniences and 
related inconveniences. 

The bound in the Davis-Kahan theorem is sensitive to the value b. This 
reflects that when there are eigenvalues of .f£ ££ close to 5, but not inside 
of S", then a small perturbation can move these eigenvalues inside of S and 
drastically alter the eigenvectors. The next theorem combines the previous 
results to show that the eigenvectors of U-"^' converge to the eigenvectors 
of ^^^> . Because it is asymptotic in the number of nodes, it is important 
to allow S and b to depend on n. For a sequence of open intervals 5„ C 7^, 
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define 

(2.3) 6n = mf{\i -s\;i£ A(^(")^(")),£ ^ Sn, s G 5„}, 

(2.4) ,5; = mf{|£-s|;£GA5„(^(")^(")),5^M, 

(2.5) S'^ = {i;e^eSn}. 

The quantity 6'^ is added to measure how well 5„ insulates the eigenvalues of 
interest. If 6'^ is too small, then some important empirical eigenvalues might 
fall outside of Sn- By restricting the rate at which Sn and 6'^ converge to 
zero, the next theorem ensures the dimensions of X and ^ agree for a large 
enough n. This is required in order to use the Davis-Kahan theorem. 

Theorem 2.2. Define W^^'> G {0, l}"^" to be a sequence of growing ran- 
dom adjacency matrices from the latent space model with population matri- 
ces W^^' . With W^"^' , define the observed graph Laplacian L^"'' as in (1-2). 
Let _5f'"^ be the population version of L^"^' as defined in (1-3). Define Tn as 
in (2.1). With a sequence of open intervals Sn C TZ, define 6n, S'^ and S'n as 
in (2.3), (2.4) and (2.5). 

Let kn = \Xs' (-^ )l; ^^6 -^^-ze of the set Xs' {L^^'). Define the matrix Xn S 
j^nxkn gy(,/j if^fj^f j^ig orthonormal columns are the eigenvectors of symmet- 
ric matrix L^"' corresponding to all the eigenvalues contained in Xgi {L^'^'). 
For Jifn = lAs;^ (.if (")) I, define S^n S T^"^-^" to be the analogous matrix for 
symmetric matrix ^("^ with eigenvalues in Xs'^{.^^'^^). 

Assume that n~^"(logn)^ = 0(min{(5„,(5(j}). Also assume that there ex- 
ists positive integer N such that for all n> N , it follows that r^ > 2/logn. 

Eventually, kn =J^n- Afterward, for some sequence of orthonormal rota- 
tions On, 

II V or n w flogn\ 

\\^n-^nOn\\F = 0^-^-^-^j a.S. 

A proof of Theorem 2.2 is in Appendix C. There are two key assumptions 
in Theorem 2.2: 

(1) n-'/\lognf = Oimm{6n,6'n}), 

(2) t2> 2/logn. 

The first assumption ensures that the "eigengap," the gap between the eigen- 
values of interest and the rest of the eigenvalues, does not converge to zero 
too quickly. The theorem is most interesting when S includes only the lead- 
ing eigenvalues. This is because the eigenvectors with the largest eigenvalues 
have the potential to reveal clusters or other structures in the network. When 
these leading eigenvalues are well separated from the smaller eigenvalues, the 
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eigengap is large. The second assumption ensures that the expected degree 
of each node grows sufficiently fast. If r^ is constant, then the expected de- 
gree of each node grows linearly. The assumption r^ > 2/ log n is almost as 
restrictive. 

The usefulness of Theorem 2.2 depends on how well the eigenvectors 
of ^("^ represent the characteristics of interest in the network. For example, 
under the Stochastic Blockmodel with B full rank, if Sn is chosen so that 5^ 
contains all nonzero eigenvalues of ^("' , then the block structure can be de- 
termined from the columns of Af„. It can be shown that nodes i and j are in 
the same block if and only if the ith row of Xn equals the jth row. The next 
section examines how spectral clustering exploits this structure, using Xn 
to estimate the block structure in the Stochastic Blockmodel. 

3. The Stochastic Blockmodel. The work of Leskovec et al. (2008) shows 
that the sizes of the best clusters are not very large in a diverse set of 
empirical networks, suggesting that the appropriate asymptotic framework 
should allow for the number of communities to grow with the number of 
nodes. This section shows that, under suitable conditions, spectral clustering 
can correctly partition most of the nodes in the Stochastic Blockmodel, even 
when the number of blocks grows with the number of nodes. 

The Stochastic Blockmodel, introduced by Holland, Laskey and Leinhardt 
(1983), is a specific latent space model. Because it has well-defined commu- 
nities in the model, community detection can be framed as a problem of 
statistical estimation. The important assumption of this model is that of 
stochastic equivalence within the blocks; if two nodes i and j are in the 
same block, rows i and j of W are equal. 

Recall in the definition of the undirected Stochastic Blockmodel, 

W = ZBZ'^, 

where Z S {0, l}"'^'^ is fixed and has exactly one 1 in each row and at least 
one 1 in each column and i? G [0, 1] is full rank and symmetric. In this 
definition there are k blocks and n nodes. If the i,gth element of Z equals 
one {Zig = 1), then node i is in block g. As before, Zj for i = 1, . . . ,n de- 
notes the ith row of Z. The matrix B G [0,1]'^^'^ contains the probability 
of edges within and between blocks. Some researchers have allowed for Z 
to be random, we have decided to focus instead on the randomness of W 
conditioned on Z. The aim of a clustering algorithm is to estimate Z (up to 
a permutation of the columns) from W. 

This section bounds the number of "misclustered" nodes. Because a per- 
mutation of the columns of Z is unidentifiable in the Stochastic Blockmodel, 
it is not obvious what a "misclustered" node is. Before giving our definition 
of "misclustered," some preliminaries are needed to explain why it is a rea- 
sonable definition. The next paragraphs examine the behavior of spectral 
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clustering applied to the population graph Laplacian Jf . Then, this is com- 
pared to spectral clustering applied to the observed graph Laplacian L. This 
motivates our definition of "misclustered." 

Recall that the spectral clustering algorithm applied to L, 

(1) finds the eigenvectors, X G i?"^'^, 

(2) treats each row of the matrix X as a point in TV^ , and 

(3) runs A;-means on these points. 

/c-means is an objective function. Applied to the points {xi, . . . ,x„} C R it 
is Steinhaus (1956), 

(3.1) min y^minllxi-mgll^. 

|mi,...,mfc|C7<.'= ^^ 9 

The analysis in this paper addresses the true optimum of (3.1). (In prac- 
tice, this optimization problem can suffer from local optima.) The vectors 
nil 1 • • • ) '^fc that optimize the /c-means function are referred to as the cen- 
troids of the k clusters. 

This next lemma shows that spectral clustering applied to the population 
Laplacian, ^, can discover the block structure in the matrix Z. This lemma 
is essential to defining "misclustered." 

Lemma 3.1. Under the Stochastic Block-model with k blocks, 

W = ZBZ^ G i?"^" for B G R^''^ and Z G {0, 1}"^*^, 

define Jf as in (1-3). There exists a matrix ^ G R^^'' such that the columns 
of Zfj, are the eigenvectors of ^ corresponding to the nonzero eigenvalues. 
Further, 

(3.2) Zifj, = Zjfi <^ Zi = Zj, 
where Zi is the ith row of Z . 

A proof of Lemma 3.1 is in Appendix D. 

Equivalence statement (3.2) implies that under the k block Stochastic 
Blockmodel there are k unique rows in the eigenvectors Z^x of ^. This has 
important consequences for the spectral clustering algorithm. The spectral 
clustering algorithm applied to ^ will run /c-means on the rows of Z[i. 
Because there are only k unique points, each of these points will be a centroid 
of one of the resulting clusters. Further, if Zi\x = Zj\x^ then i and j will be 
assigned to the same cluster. With equivalence statement (3.2), this implies 
that spectral clustering applied to the matrix ^ can perfectly identify the 
block memberships in Z. Obviously, ^ is not observed. In practice, spectral 
clustering is applied to L. Let X G R^^^ be a matrix whose orthonormal 
columns are the eigenvectors corresponding to the largest k eigenvalues (in 
absolute value) of L. 
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Definition 3. Spectral clustering applies the fc-means algorithm to the 
rows of X, that is, each row is a point in R . Each row is assigned to one 
cluster and each of these clusters has a centroid. Define ci, . . . , c^ G R such 
that Cj is the the centroid corresponding to the ith row of X . 

Recall that Zifi is the centroid corresponding to node i from the population 
analysis. If the observed centroid q is closer to the population centroid Zifi 
than it is to any other population centroid Zjfi for Zj 7^ Zi, then it appears 
that node i is correctly clustered. This definition is appealing because it re- 
moves some of the cluster identifiability problem. However, the eigenvectors 
add one additional source of undentifiability. Let O G Jl^^^ be the orthonor- 
mal rotation from Theorem 2.2. Consider node i to be correctly clustered 
if, Cj is closer to ZifiO than it is to any other (rotated) population cen- 
troid Zjfj,0 for Zj 7^ Zi. The slight complication with O stems from the fact 
that the vectors ci , . . . , c^ are constructed from the eigenvectors in X and 
Theorem 2.2 shows these eigenvectors converge to the rotated population 
eigenvectors: ^O = ZfiO. 

Define P to be the population of the largest block in Z. 

(3.3) P= max {Z'^Z)jj. 

j = l,...,A: 

The following provides a sufficient condition for a node to be correctly clus- 
tered. 

Lemma 3.2. For the orthonormal matrix O ^TZ from Theorem 2.2, 

(3.4) \\ci-ZifiO\\2<l/V2P 

(3.5) =^ \\ci- Zifj.0\\2<\\ci- ZjfiOy for any Zj ^ Zi. 

A proof of Lemma 3.2 is in Appendix D. 

Line (3.5) is the previously motivated definition of correctly clustered. 
Thus, Lemma 3.2 shows that the inequality in line (3.4) is a sufficient con- 
dition for node i to be correctly clustered. 

Definition 4. Define the set of misclustered nodes as the nodes that 
do not satisfy the sufficient condition (3.4), 

(3.6) J^ = {i■.\\ci-z^^lO\\2>l/V2P}. 

The next theorem bounds the size of the set ^ . 

Theorem 3.1. Suppose W G 7^"X" is an adjacency matrix from the 
Stochastic Blockmodel with kn blocks. Define the population graph Lapla- 
cian, ^ , as in (1.3). Define |Ai| > IA2I > • • • > [^kn\ > as the absolute val- 
ues of the kn nonzero eigenvalues of J^ . Define .M , the set of misclustered 
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nodes, as in (3.6). Define Tn as in (2.1) and assume there exists N such that 
for alln> N, rl > 2/logn. Define Pn as in (3.3). Ifn~^/'^{\ognf = 0{XlJ, 
then the number of misclustered nodes is bounded 

' Pn(logn)^ ' 

A proof of Theorem 3.1 is in Appendix D. The two main assumptions of 
Theorem 3.1 are 

(1) n-y\lognf = 0{Xl), 

(2) eventually r^logn>2. 

They imply the conditions needed to apply Theorem 2.2. The first assump- 
tion requires that the smallest nonzero eigenvalue of ^ is not too small. 
Combined with an appropriate choice of Sn, this assumption implies the 
eigengap assumption in Theorem 2.2. The second assumption is exactly the 
same as the second assumption in Theorem 2.2. Section 4 investigates the 
sensitivity of spectral clustering to these two assumptions. Section 5 exam- 
ines the plausibility of assumption (2) on five empirical online social net- 
works. 

In all previous spectral clustering algorithms, it has been suggested that 
the eigenvectors corresponding to the largest eigenvalues reveal the clus- 
ters of interest. The above theorem suggests that before finding the largest 
eigenvalues, you should first order them by absolute value. This allows for 
large and negative eigenvalues. In fact, eigenvectors of L corresponding to 
eigenvalues close to negative one (all eigenvalues of L are in [—1,1]) discover 
"heterophilic" structure in the network that can be useful for clustering. 
For example, in the network of dating relationships in a high school, two 
people of opposite sex are more likely to date than people of the same sex. 
This pattern creates the two male and female "clusters" that have many 
fewer edges within than between clusters. In this case, L would likely have 
an eigenvalue close to negative one. The corresponding eigenvector would 
reveal these "heterophilic" clusters. 

Example. To examine the ability of spectral clustering to discover het- 
erophilic clusters, imagine a Stochastic Blockmodel with two blocks and two 
nodes in each block. Define 

In this case, there are no connections within blocks and every member is 
connected to the two members of the opposite block. There is no variability 
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in the matrix W. The rows and columns of L can be reordered so that 
it is a block matrix. The two block matrices down the diagonal are 2x2 
matrices of zeros and all the elements in the off diagonal blocks are equal 
to 1/2. There are two nonzero eigenvalues of L. Any constant vector is an 
eigenvector of L with eigenvalue equal to one. The remaining eigenvalue 
belongs to any eigenvector that is a constant multiple of (1,1,-1,-1). In 
this case, with perfect "heterophilic" structure, the eigenvector that is useful 
for finding the clusters has eigenvalue negative one. 

Heuristically, the reason spectral clustering can discover these heterophilic 
blocks is related to our method of proof. The i,jth element of WW is the 
number neighbors that nodes i and j have in common. In both heterophilic 
and homophilic cases, if nodes i and j are in the same block, then they should 
have several neighbors in common. Thus, [VKVF]ij is large. Similarly, [LL]ij 
is large. This shows that the number of common neighbors is a measure 
of similarity that is robust to the choice of hetero- or homophilic clusters. 
Because spectral clustering uses a related measure of similarity, it is able to 
detect both types of clusters. 

In order to clarify the bound on \^\ in Theorem 3.1, a simple example 
illustrates how A^^, t„ and P might depend on n. 

Definition 5. The four parameter Stochastic Blockmodel is parametri- 
zed by k,s,r and p. There are k blocks each containing s nodes. The prob- 
ability of a connection between two nodes in two separate blocks is r G [0,1] 
and the probability of a connection between two nodes in the same block is 
p + rG [0,1]. 

Example. In the four parameter Stochastic Blockmodel, there are n = 
ks nodes. Notice that Pn = s and t„ > r. Appendix D shows that the smallest 
nonzero eigenvalue of the population graph Laplacian is equal to 

^^ " kir/p) + l- 
Using Theorem 3.1, ii p^O and k = 0{n^'^/logn), then 
(3.7) \^\=o{k^ {log nf) a.s. 

Further, the proportion of nodes that are misclustered converges to zero, 

= o(n~^'^) a.s. 
n 

This example is particularly surprising after noticing that if A; = n" for a € 
(0, 1/4), then the vast majority of edges connect nodes in different blocks. To 
see this, look at a sequence of models such that k = n'^. Note that s = n^~". 
So, for each node, the expected number of connections to nodes in the same 
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block is {p + r)n^~°' and the expected number of connections to nodes in 
different blocks is r(n — n^~'^). 

Expected number of in block connections (p + r)n^~" „q, 

^ ^7T = ^('^ )• 



Expected number of out of block connections r{n — n} '^ 

These are not the tight communities that many imagine when considering 
networks. Instead, a dwindling fraction of each node's edges actually connect 
to nodes in the same block. The vast majority of edges connect nodes in 
different blocks. 

A more refined result would allow r to decay with n. However, when r 
decays, so does the minimum expected degree and the tail bounds used in 
proving Theorem 2.1 requires the minimum expected degree to grow nearly 
as fast as n. Allowing r to decay with n is an area for future research. 

4. Simulations. Three simulations in this section illustrate how the asymp- 
totic bounds in this paper can be a guide for finite sample results. These 
simulations emphasize the importance of the eigengap in Theorem 2.2 and 
suggest that the asymptotic bounds in this paper hold for relatively small 
networks. The simulations also suggest two shortcomings of the theoreti- 
cal results in this paper. First, Simulation 1 shows that spectral clustering 
appears to be consistent in some situations. Unfortunately, the theoretical 
results in Theorem 3.1 are not sharp enough to prove consistency. Second, 
Simulation 3 suggests that spectral clustering is still consistent even when 
the minimum expected node degree grows more slowly than the number of 
nodes. However, the theorems above require a stronger condition, that the 
minimum expected degree grows almost linearly with the number of nodes. 

All data are simulated from the four parameter Stochastic Blockmodel 
(Definition 5). In the first simulation, the number of nodes in each block s 
grows while the number of blocks k and the probabilities "p and r remain 
fixed. In the second simulation, k grows while s,p and r remain fixed. In the 
final simulation, s and k remain fixed while r and p shrink such that pjr 
remains fixed. Because kr jp is fixed, the eigengap is also fixed. 

There is one important detail to recreate our simulation results below. The 
spectral clustering result stated in Theorem 3.1, requires the true optimum 
of the fc-means objective function. This is very difficult to ensure. However, 
only one step in the proof of Theorem 3.1 requires the true optimum. The 
optimum of A;-means satisfies inequality D.4 in the Appendix D. In simu- 
lations, this inequality can be verified directly. For the simulations below, 
the /c-means algorithm is run several times, all with random initializations, 
until the bound D.4 is met. 

Simulation 1: In this simulation, k = 5,p = 0.2, r = 0.1 and the number 
of members in each group grows from 8 to 215. This implies that n grows 
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Fig. 1. The top panel m this figure displays the number of misclustered nodes plotted 
against logn. The bottom panel displays both log \\LL — JifJif\\F and log \\X — ,i^~0\\f plotted 
against logn. Each dot represents one simulation of the model. In addition, the bottom 
panel has a line with slope —1/2. This figure illustrates two things. First, after a certain 
threshold (around logn = 4.7j, the eigenvectors of the graph Laplacian begin to converge 
and after this point, the number of misclustered nodes converges to zero. Second, the lines 
representing log \\LL — ^,S£\f and log ||X — ^0||f are approximately parallel to the line 
with slope —1/2. This suggests that they converge around rate 0{n^^''^), similar to the 
theoretical results in Lemma 2.1 and Theorem 2.2. 



from 40 to 1075. Equation (3.7) suggests that the number of misclustered 
nodes should grow more slowly than (logn)^. In fact, Figure 1 shows that 
once there are enough nodes, the number of misclustered nodes converges to 
zero. The top plot displays the number of misclustered nodes plotted against 
logn, which initially increases. Then, it falls precipitously. 

The lower plot in Figure 1 displays why the number of misclustered 
nodes falls so precipitously. It plots log || LL — Jf^Hi? (dashed bold line) 
and log||X — I%^0\\f (solid bold line) on the vertical axis against logn on 
the horizontal axis. Also displayed in this plot is a line with slope —1/2 (solid 
thin line). Note that the solid bold line starts to run parallel to the solid thin 
line once logn > 4.5. After this point, the eigenvectors converge, and spectral 
clustering begins to correctly cluster all of the nodes. The proof of the con- 
vergence of the eigenvectors for Theorem 2.2, requires an eigengap condition, 

n^^'^logn = 0{m.ui{5n,5'^}). 

Similar to the example in the previous section, Sn can be chosen in this four 
parameter model so that min{(5.„,5(j} = {k{r/p) + 1)"^. In this simulation. 
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the eigenvectors begin to converge, and the number of miclustered nodes 
drops just after the bound n~^'^ < {k{r/p) + 1)"^ is met. Ignoring the logn 
factor, this suggests that the eigengap condition in Theorem 2.2 is necessary. 

This simulation demonstrates the importance of the relationship between 
the sample size and the eigengap. In this simulation, there needs to be 
roughly 50 nodes in each block to separate the informative eigenvectors 
from the uninformative eigenvectors. Once there are enough nodes, the em- 
pirical eigenvectors are close to the population eigenvectors. Then, spectral 
clustering can estimate the block structure. 

The lower plot in Figure 1 also suggests that, ignoring logn factors, the 
rates of convergence given in Theorem 2.1 and Theorem 2.2 are sharp. 
Both LL and the eigenvectors X converge at a rate 0(n~^'^). This is be- 
cause the the dashed bold line and the solid bold line (for large enough n) 
are approximately parallel to the solid thin line. 

Simulation 2: In this simulation from the four parameter Stochastic Block- 
model, each block contains 35 nodes, p = 0.3 and r = 0.05. The number of 
blocks k grows from 2 to 110. Equation (3.7) suggests that under this asymp- 
totic regime, the number of misclustered nodes should grow more slowly than 
/c^(logn)^. Figure 2 shows how this theoretical quantity can be an appropri- 
ate guide. 

With more blocks, clustering is more difficult 
p=.3 r=.05 35 members of each block 




log(k) 

Fig. 2. This figure plots the number of miclustered nodes (thicker line) against logfc. 
Each dot represents one simulation from the model. Additionally, there is a line with slope 3 
(thinner line). Equation (3.7) says that the number of misclustered nodes is o(fc'^logfc). 
Because the thicker line has a slope that is similar to the thinner line, this result appears 
to be a good approximation. 
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Figure 2 plots the log of the number of misclustered nodes (bold line) 
against logk. For comparison, a line with slope 3 is also plotted (thin line). 
Because the bold line has a slope approximately equal to the thin line, the 
number of misclustered nodes is approximate to k^. 

This simulation demonstrates that as the number of blocks grows, the 
number of misclustered nodes also grows. Although \\LL — J^^\\f converges 
under this asymptotic regime, \\X — ^0\\f does not because the eigengap 
shrinks more quickly than the number of nodes can tolerate. 

Simulation 3: The theorems in this paper assume that the smallest ex- 
pected degree grows close to linearly with the number of nodes in the graph. 
This simulation examines the sensitivity of spectral clustering to this as- 
sumption. Recall that the smallest expected degree is equal to nr. 

In this simulation, there are three different designs all from the four pa- 
rameter Stochastic Blockmodel. Each design has three blocks (A; = 3). One 
design contains 50 nodes in each block, another contains 150 in each block, 
and the last design contains 250 nodes in each block. To investigate how 
sensitive spectral clustering is to the value of r = p/k + r, the probabilities p 
and r must change. However, to isolate the effect of r from the effect of 
the eigengap {k{r/p) + 1)~^, it is necessary to keep the ratio p/r constant. 
Fixing p/r = 2 ensures that the eigengap is fixed at 4/25. 

The results for Simulation 3 are displayed in Figure 3. The value r is on 
the horizontal axis, and the number of misclustered nodes is on the vertical 
axis. There are three lines. The thickest line represents the design with 50 
nodes in each block. The line of medium thickness represents the design with 
150 nodes in each block. The thinnest line represents the design with 250 
nodes in each block. All three lines increase as r approaches zero (reading the 
figure from right to left). The thickest line starts to increase at r = 0.20. The 
thinnest line starts to increase at r = 0.07. The line with medium thickness 
increases in between these two lines. 

Because the thinner lines start to increase at a smaller value of r, this 
suggests that as n increase, r can decrease. As such, spectral clustering 
should be able to correctly cluster the nodes in a Stochastic Blockmodel 
graph when the minimum expected degree does not grow linearly with the 
number of nodes in the graph. 

Lemma 2.1, Theorem 2.2, and Theorem 3.1 all require the minimum ex- 
pected degree to grow at the same rate as n (ignoring logn terms). Al- 
though the strict assumption is inappropriate for large networks, this simu- 
lation demonstrates (1) that spectral clustering works for smaller networks 
and (2) that the asymptotic theory presented earlier in the paper can be 
a guide to smaller networks. In these networks, it is not as unreasonable 
that each node would be connected to a significant proportion of the other 
nodes. 
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Fig. 3. This figure displays the number of misclustered nodes from three different models 
plotted against t = rainiE{Dii)/n. The first model has 50 nodes in each block (thickest 
line), the second model has 150 nodes in each block (line with medium thickness), the third 
model has 250 nodes in each block (thinnest line). Each dot represents the average of ten 
simulations from the model. In each of these models, p and r decrease such that p/r is 
always equal to 2. This ensures that t goes to zero, while the eigengap remains constant. 
Each of the three models is sensitive to small values of t . However, the larger models can 
tolerate a smaller value of r. This suggests that as n increases, t should be allowed to 
decrease. The theorems in this paper do not allow for that possibility. 



5. Empirical edge density. In several networks, there is a natural or 
canonical notion of what an edge represents. In an online social network, 
friendship is the canonical notion of an edge. With this canonical notion, 
the edges in most empirical networks are not dense enough to suggest the 
asymptotic framework assumed in Lemma 2.1, Theorems 2.2 and 3.1. 

Although it is an area of future research to weaken the strong assumption 
on the expected node degrees, there are potentially other notions of similar- 
ity that can replace the canonical notion. Define the canonical edge set Ec 
to contain (i,j) if nodes i and j are connected with a canonical edge. One 
possible extension of Ec is 



(5.1) 



Eff = {(«, j) : if {i, k) E Ec and {k,j) G Ec for some k}. 



In words, {i,j) G Eff if i and j are friends of friends. 

Table 1 investigates the edge density of five empirical network defined us- 
ing both Ec and Eff. These five networks come from the Facebook networks 
of five universities: California Institute of Technology (Caltech), Princeton 
University, Georgetown University, University of Oklahoma, University of 
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Table 1 

This table describes five basic characteristics of the Facebook social network within five 

universities. In the table below, deg'^ is the average node degree using the canonical edges 

of friendship and deg-'^-'^ is the average node degree using the 'friends- of- friends" edges as 

defined with (5.1). The statistics Tc and Tff [defined in (5.4) and (5.5)] are equal to the 

percent of nodes that are connected to more than 10% of the nodes in the graph. The 

table below shows that the network is much more connected when using edges defined by 

"friends-of-friends. " All numbers are rounded to the nearest integer 



School 


Caltech 


Princeton 


Georgetown 


Oklahoma 


UNC 


n 


769 


6,596 


9,414 


17,425 


18,163 


deg^ 


43 


89 


90 


102 


84 


deg// 


487 


2,663 


3,320 


5,420 


5,242 


n 


16 














Tff 


94 


88 


87 


81 


79 



North Carolina at Chapel Hill (UNC). Traud et al. (2008) made these data 
sets publicly available and investigated the community structure in them. 

Let W^ denote the adjacency matrix constructed from Ec- Let W-^^^ denote 
the adjacency matrix constructed from Eff. Let deg*^ G 7^"" and deg'''' € T?."' 
denote the degree sequences of the nodes with respect to the two edge sets Ec 
and Eff. That is, degf = ^ • W-^, . Similarly for deg'^. Define 



(5.2) deg^ = -j;degf, 

i 

(5.3) d^^^ = ^J]degf^, 

i 

(5.4) T, = i^j;i{deg>n/10}, 



(5.5) Tff = 1^^ ijdeg^ > n/10}. 



n 



The first two quantities are equal to the average node degrees. The last two 
quantities are the percent of nodes connected to more that 10% of the nodes 
in the network. 

Table 1 demonstrates how the edge density increases after replacing Ec 
with Eff. The statistics Tc and Tff, in the last two lines of the table, can be 
used to gauge the suitability of the assumption r^ > 2/logn in the theorems 
above. Recall that r is the minimum expected degree divided by n. So, for 
example, if Tff = 1, then it is reasonable to expect that r > 1/10. Because 
there are some nodes that have a very small degree, Tc and Tff look at the 
proportion of nodes that are well connected. 
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It is an empirical observation that graphs have sparse degrees. This sug- 
gests that the assumption r^ > 2/logn in Lemma 2.1, Theorem 2.2 and 
Theorem 3.1 is not satisfied in practice. Table 1 demonstrates that by using 
an alternative notion of adjacency or connected, the network can become 
much more connected. 

6. Discussion. The goal of this paper is to bring statistical rigor to the 
study of community detection by assessing how well spectral clustering can 
estimate the clusters in the Stochastic Blockmodel. The Stochastic Block- 
model is easily amenable to the analysis of clustering algorithms because of 
its simplicity and well-defined communities. The fact that spectral cluster- 
ing performs well on the Stochastic Blockmodel is encouraging. However, 
because the Stochastic Blockmodel fails to represent fundamental features 
that most empirical networks display, this result should only be considered 
a first step. 

This paper has two main results. The first main result. Theorem 2.2, 
proves that under the latent space model, the eigenvectors of the empirical 
normalized graph Laplacian converge to the eigenvectors of the population 
normalized graph Laplacian — so long as (1) the minimum expected degree 
grows fast enough and (2) the eigengap that separates the leading eigenvalues 
from the smaller eigenvalues does not shrink too quickly. This theorem has 
consequences in addition to those related to spectral clustering. 

Visualization is an important tool for social networks analysts [Liotta 
(2004), Freeman (2000), Wasserman and Faust (1994)]. However, there is 
little statistical understanding of these techniques under stochastic models. 
Two visualization techniques, factor analysis and multidimensional scaling, 
have variations that utilize the eigenvectors of the graph Laplacian. Similar 
approaches were suggested for social networks as far back as the 1950s [Bock 
and Husain (1952), Breiger, Boorman and Arable (1975)]. Koren (2005) 
suggests visualizing the graph using the eigenvectors of the unnormalized 
graph Laplacian. The analogous method for the normalized graph Laplacian 
would use the ith row of X as the coordinates for the ith node. Theorem 2.2 
shows that, under the latent space model, this visualization is not much 
different than visualizing the graph by instead replacing X with ,^ . If there 
is structure in the latent space of a latent space model (e.g., the zi,. . . ,Zn 
form clusters) and this structure is represented in the eigenvectors of the 
population normalized graph Laplacian, then plotting the eigenvectors will 
potentially reveal this structure. 

The Stochastic Blockmodel is a specific latent space model that satisfies 
these conditions. It has well-defined clusters or blocks and Lemma 3.1 shows 
that, under weak assumptions, the eigenvectors of the population normalized 
graph Laplacian perfectly identify the block structure. Theorem 2.2 suggests 
that you could discover this clustering structure by using the visualization 
technique proposed by Koren (2005). The second main result, Theorem 3.1, 
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goes further to suggest just how many nodes you might miscluster by running 
fc-means on those points (this is spectral clustering). Theorem 3.1 proves 
that if (1) the minimum expected degree grows fast enough and (2) the 
smallest nonzero eigenvalue of the population normalized graph Laplacian 
shrinks slowly enough, then the proportion of nodes that are misclustered 
by spectral clustering vanishes in the asymptote. 

The asymptotic framework applied in Theorem 3.1 allows the number 
of blocks to grow with the number of nodes; this is the first such high- 
dimensional clustering result. Allowing the number of clusters to grow is 
reasonable because as Leskovec et al. (2008) noted, large networks do not 
necessarily have large communities. In fact, in a wide range of empirical 
networks, the tightest communities have a roughly constant size. Allowing 
the number of blocks to grow with the number of nodes ensures the clusters 
do not become too large. 

There are two main limitations of our results that are highlighted in the 
simulations in Section 4. First, Theorem 3.1 does not show that spectral clus- 
tering is consistent under the Stochastic Blockmodel; it only gives a bound 
on the number of misclassified nodes. Improving this bound is an area for 
future research. The second shortcoming is that Lemma 2.1, Theorems 2.2 
and 3.1 all require the minimum expected degree to grow at the same rate 
as n (ignoring logn terms). In large empirical networks, the canonical edges 
are not dense enough to suggest this type of asymptotic framework. Sec- 
tion 5 suggests alternative definitions of edges that might increase the edge 
density. That said, studying spectral clustering under more realistic degree 
distributions is an area for future research. 

APPENDIX A: PROOF OF THEOREM 2.1 

First, a proof of Lemma 2.1. 

Proof of Lemma 2.1. By eigendecomposition, M = Yl'i=i ^iUiuf whe- 
re ui,. . . ,Un are orthonormal and eigenvectors of M. So, 

(n \ / " \ '^ 

i=l J \i=l J i=l 

Right multiplying by any Ui yields MMui = X^Ui. This proves one direction 
of part one in the lemma, if A is an eigenvalue of M, then A^ is an eigenvalue 
of MM. It also proves part two of the lemma, all eigenvectors of M are also 
eigenvectors of MM. 

To see that if A^ is an eigenvalue of MM, then A or —A is an eigenvalue 
of M, notice that both M and MM have exactly n eigenvalues (counting 
multiplicities) because both matrices are real and symmetric. So, the pre- 
vious paragraph specifies n eigenvalues of MM by squaring the eigenvalues 
of M. Because MM has exactly n eigenvalues, there are no other eigenvalues. 
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The rest of the proof is devoted to part three of the lemma. Let MMv = 
\^v. By eigenvalue decomposition, M = '^-\iUiuf and because ui,...,Un 
are orthonormal (M is real and symmetric) there exists ai, . . . , a„ such that 

A^ ^ OiUi = X^v = MMv = M ( ^ XiUiuJv j = M ( X] ^i^i^i ) 

i i i 

= ^ XiOiMui = ^ X^UiUi. 

i i 

By the orthogonality of the Uj's, it follows that A^Qj = A?aj for all i. So, if 
A2/A2, then ai = 0. D 

For i = 1, . . . , n, define Cj = Siajn and r = minj=i^...^„ q. 
Lemma A. 1. //n^/^/logn > 2, 

pf||LL-.^^||^> ''f;°!" ') <4n2"2.^'°^". 

The main complication of the proof of Lemma A.l is controlling the de- 
pendencies between the elements of LL. We do this with an intermediate 
step that uses the matrix 

and two sets F and A. F constrains the matrix D, while A constrains the ma- 
trix W^~^W. These sets will be defined in the proof. To ease the notation, 
define 

PrA(S) = P(SnFnA), 

where B is some event. 

Proof of Lemma A.l. This proof shows that under the sets F and A 
the probability of the norm exceeding 32y/2log{n)T~'^n~^''^ is exactly zero 
for large enough n and that the probability of F or A not happening is 
exponentially small. To ease notation, define a = 32\/2log{n)T~'^n~^''^ . 

The diagonal terms behave differently than the off diagonal terms. So, 
break them apart: 

LL-^^\\F>a) 

< PrAdl^^ - -^^Wf >a)+ P((F n A)^) 



FrxfY^iLL - ^^\% > aA + P((F n A)^) 
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+ p((rnA)^). 

First, address the sum over the off diagonal terms: 

<PrA(u{[^^-^-^]|>|^}) 
<EPrA(|LL-i^^|,>-|-) 

< XI PrA (\LL - LL\,, + \LL - ifif |,, > -^\ 
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(A.l) 






FtaI ILL - LL\i.: > 



\/8n 



(A.2) 



+ FrA[\LL-^^\,j> 



VSn 



The sum over the diagonal terms is similar, 
PrAfe[Li^-i?i^]f.>aV2) 

i 



TAI \LL- LL\ii > 



+ FrA{\LL-^^\ii> 



with one key difference. In (A.l), the union bound address nearly n^ terms. 
This yields the 1/n^ term in line (A.l). After taking the square root, each 
term has a lower bound with a factor of 1/n. However, because there are 
only n terms on the diagonal, after taking the square root in the last equation 
above, the lower bound has a factor of 1/y/n. 

To constrain the terms \LL — ^££\ij for i = j and ? 7^ j, define 



A = p|| ^(WikWji, - Pijk)/ck 

i,j ^ k 



< n 



V2lon 



n 
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Pijk 



PikPjki 
Pik, 



ifi=j, 



a 



for pij = Wij. We now show that for large enough n and any i^ j, 

(A.3) PA(|LL-if^|ij > 

(A.4) f>J\LL-^^\ii>-^j=0. 

To see (A.3), expand the left-hand side of the inequality for i^ j, 

1 



\LL-^^\i^ 



V 



(^ii%)V2 
1 



J20^ikWjk - PikPjk)/&kk 



n^, CjC 



joj 



y^^{WikWjk-pikPjk)/ck 



This is bounded on A, yielding 



|ZZ-^^|,,<i^<'2^^°^" 



rn3/2 - ^r2n3/2 ^n 

So, (A.3) holds for i / j. Equation (A.4) is different because Wf^. = Wik- As 
a result, the diagonal of LL is a biased estimator of the diagonal of ^^ . 



\LiLi — Ji^J^L'j 



(A.5) 



< 



E 

k 

E 



^l-v. 



ik 



%i^kk 

Wik - Pik 



yWik-pJk 



%i^kk 



+ 



E 



%i^kk 



Pik -Pik 



%i^kk 



= 2\ Y^iWik - Pik) / Ck + ^{Pik - Pfk) I Ck 

^'"^ ^ k k 

Similar to the i^j case, the first term is bounded by log(n)r^^n~'^'2 on A. 
The second term is bounded by T~'^n~^: 



aw 



^{Pik - pik) I Ck 



< 



an^ 



Ev^ 



< 



r'^n 



Substituting the value of a in reveals that on the set A, both terms in (A.5) 
are bounded by a{2y/8n)~^ . So, their their sum is bounded by a(\/8n)~^, 
satisfying (A.4). 



CLUSTERING FOR THE STOCHASTIC BLOCKMODEL 29 

This next part addresses the difference between LL and LL, showing that 
for large enough n, any i^ j, and some set T, 



Fr.{\LL-LL\,>-^)=0, 



PrA( \LL-LL\u>^]=0. 
Von, 



It is enough to show that for any i and j, 



(A.6) VrAi\LL-LL\ij>^\=0. 

For b{n) = log(n)n~^' ^, define u{n) = 1 + h{n),l{n) = 1 — h{n). With these 
define the foUowing sets: 

i 
i 
''ij 

rn^ = n [ ^ p [u{nrM{nr^] \ 

Notice that T C r(l) C r(2) and T C r(3). Define another set: 
rm = n[ ^ f= [l-l66(n),l + 166(n)] ] 

The next steps show that this set contains T. It is sufficient to show r(3) C 
r(4). This is true because 

1 _ 1 _ 6(n)^2 6(n)"2_i 



n(n)2 (l + 6(n))2 (6(n)-i + 1)2 " (6(n)-i + 1)2 

6(n)-i-l 2 

6(n)-i + l 6(n)-i + l ^ ^ 

The 16 in the last bound is larger than it needs to be so that the upper and 
lower bounds in r(4) are symmetric. For the other direction, 

1 _ 1 _ 6(n)-2 _f^^ 1 ^2 



l{ny (l-6(n))2 (6(n)-i-l)2 V Kn)"^ - 1 



2 1 

l + T^^-, r + 



6(n)-i-l (6(n)-i-l) 



2 ■ 
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We now need to bound the last two elements here. We are assuming, 
n/ log n > 2. Equivalently, 1 — b{n) > 1/2. So, we have both of the following: 

1 2,2 2b{n) „. , . 



< 



(6(n)-i - 1)2 6(n)-i - 1 
Putting these together, 

1 



l{n 



"^""^ 6(n)-i-l l-b{n) 
2 <l + 166(n). 



<mn . 



This shows that T C r(4). Now, under the set F, and thus r(4), 



LL-LLl 



< 



k 

E 



k 

sE 



1 



DkkiDiiDjjY'^ &kk{%^%jY'^ 

166(ri) 



< 



E 



&kk{%i&nY'^ 
166(n) lQb{n) 



t'^ti? 



r'^n 



This is equal to a{-\/Sn) ^, showing (A. 4) holds for all i and j. 

The remaining step is to bound P((r n A)*^). Using the union bound, this 
is less than or equal to P(r'') + P(A^): 



(r) =¥{\j{Dii i &ii[l - b{n), 1 + b{n)]} 

^ i 

< J^P({Ai ^ %i[l - b{n), 1 + b{n)]}) 



< 



^2exp 



^iilogn\ 1 



n 



n 



<2nexp(-2r2(logn)2) 

= 2?il-2r2logn^ 

where the second to last inequality is by Hoeffding's inequality. The next 
inequality is Hoeffding's: 

> n ' log n 
> n ' log n 



P(A^) =pf IJ ^{WikWjk-Pijk)/ck 
^i,j ^ k 

= Y.^( Y.^WikW,k-Pijk)/ck 



«J 
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<j;2exp('-2n(logn)y^l/cA 

<^2exp(-2(logn)V2) 
<2n2exp(-2(logn)V2) 

<2j^2-2r2logn_ 

Because W is symmetric, the independence of the Wi^Wj^ across k is not 
obvious. However, because Wu = Wjj = 0, they are independent across k. 
Putting the pieces together, 



+ p((rnA)^) 

< + 2ni-2^' i°g" + 27i2-2r' i°s" 

<4n2-2^'i°s". D 

The following proves Theorem 2.1. 

Proof of Theorem 2.1. Adding the n super- and subscripts to Lem- 
ma A.l, it states that if n^'^/logn > 2, then 

for c = 32v2. By assumption, for all n> N, r^ log 77, > 2. This implies that 
2 — 2r^logn < — 2 for all n> N . Rearranging and summing over ?i, for any 
fixed e > 0, 

||lWlW_^(«)^W||^ \ ^ „ 

II 11-^ ^^ ^ 1 -^^ AT I A X -vT^ ■^T^n 



h V cr„-2log(77)n-i/Ve " ;" ^i^, 



^2-2r^logn 



77-2, 



<A^ + 4 ^ 

n=Ar+l 

which is a summable sequence. By the Borel-Cantelli theorem. 
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APPENDIX B: DAVIS-KAHAN THEOREM 

The statement of the theorem below and the preceding explanation come 
largely from von Luxburg (2007). For a more detailed account of the Davis- 
Kahan theorem, see Stewart and Sun (1990). 

To avoid the issues associated with multiple eigenvalues, this theorem's 
original statement is instead about the subspace formed by the eigenvec- 
tors. For a distance between subspaces, the theorem uses "canonical angles," 
which are also known as "principal angles." Given two matrices Mi and M2 
both in 7^"^P with orthonormal columns, the singular values (o"i, . . . ,(Tp) 
of M[M2 are the cosines of the principal angles (cos ©i, . . . , cos Gp) between 
the column space of Mi and the column space of M2. Define sin0(Mi, M2) 
to be a diagonal matrix containing the sine of the principal angles of M[M2 
and define 

(B.l) d(Mi,M2) = ||sine(Mi,M2)||F, 

which can be expressed as {p — Yl^j=i^])^ by using the identity sin^6' = 

l-cos'^9. 

Proposition B.l (Davis-Kahan). Let S CTZ be an interval. Denote X 
as an orthonormal matrix whose colum,n space is equal to the eigenspace 
of ^^ corresponding to the eigenvalues in As(^.if) [more formally, the 
column space of X is the image of the spectral projection of ££J£ induced by 
Xs{^^)]- Denote by X the analogous quantity for LL. Define the distance 
between S and the spectrum of ^^ outside of S as 

5 = min{|A — s|; A eigenvalue of .^^,X ^ S,s £ S}. 

Then the distance d{X,X) = \\sm@{X,X)\\p between the column spaces of X 
and X is bounded by 

In the theorem, ^^ and LL can be replaced by any two symmetric ma- 
trices. The rest of this section converts the bound on d{X, X) to a bound on 
\\X — AfO||i7, where O is some orthonormal rotation. For this, we will make 
an additional assumption that X and X have the same dimension. Assume 
there exists S cTZ containing k eigenvalues of Jf ^ and k eigenvalues of 
LL, but containing no other eigenvalues of either matrix. Because LL and 
^^ are symmetric, its eigenvectors can be defined to be orthonormal. Let 
the columns of ,^ G TZ^^^ be k orthonormal eigenvectors of ^.if corre- 
sponding to the k eigenvalues contained in S. Let the columns of X G j^it-^^ 
be k orthonormal eigenvectors of LL corresponding to the k eigenvalues con- 
tained in S. By singular value decomposition, there exist orthonormal matri- 
ces U,V and diagonal matrix S such that ^'^X = UT,V'^ . The singular 
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values, o"i,...,o"fc, down the diagonal of S are the cosines of the principal 
angles between the columns space of X and the column space of ,^' . 

Although the Davis-Kahan theorem is a statement regarding the principal 
angles, a few lines of algebra show that it can be extended to a bound on the 
Frobenius norm between the matrix X and S'UV'^ , where the matrix UV'^ 
is an orthonormal rotation: 

hx - s:uv^\\l = ]- trace((x - a^uv^Y{x - a^uv^)) 

= ^ iraceiVU^ ^^ ^UV^ + X'^X - 2VU^ 3^^ X) 
= -{k + k- 2tTace{VU^^^X)) 

< U ^ 

where the last inequality is explained below. It follows from a property of 
the trace, the fact that the singular values are in [0,1], the trigonometric 
identity cos^ 9 = \ — s\v? 9 and the Davis-Kahan theorem: 

k k k 

tvace{VU^ ^^ X) = ^ai> ^(cosBi)^ = ^ 1 - (smOif 

i=l i=l i=l 

\\T T — ^ 9^112 

= k - id{X, S^)f > k - "^^ J^"^ . 

This shows that the Davis-Kahan theorem can instead be thought of as 
a bounding \\2^UV'^ — XW^p instead of d{S^ ,X). The matrix O in Theo- 
rem 2.1 is equal to UV'^ . In this way, it is dependent on X and X. 

APPENDIX C: PROOF OF THEOREM 2.2 

By Lemma 2.1, the column vectors of X„ are eigenvectors of L^"''L^^> 
corresponding to all the eigenvalues in \s^{L^^' L^'^') . For the application 
of the Davis-Kahan theorem, this means that the column space of Xn is 
the image of the spectral projection of L^^'L^""' induced by Xs„{L^"'' L^"'') , 
similarly for the column vectors of J?^, the matrix ^(")^("-) and the set 

Recall that X]^ > • • • > An are defined to be the eigenvalues of symmetric 
matrix _5f ("■)^(") and A^" > • • • > An are defined to be the eigenvalues of 
symmetric matrix L^^'L^^' . From (2.2), 

ix(n) T(n)| / logn 
max A,- — A,- '\ ="' 



T^n^/"^ 
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By assumption, r^ > 2/logn. So, 

logn (logn)2 

< ^ 1/9 = 0{mm{5n,5J), 



where the last step follows by assumption. Thus, 

max|A-"^ - A-"^| = o(min{(5„,(5^}). 

i 

This means that, eventually, A^- G 5„ if and only if A^ G 5„. Thus, the 
number of elements in A5„(^("')=^^"') is eventually equal to the number of 
elements in Xs„{L^^' L^^') implying that Xn and J^n will eventually have 
the same number of columns, kn = J^. 

Once Xn and ^n have the same number of columns, define matrices Un 
and Vn with singular value decomposition: ^^Xn = UrJlnVn ■ Define On = 
UnVn ■ The result follows from the Davis-Kahan theorem and Theorem 2.1: 

\\Xn - ^nOnWF < J-^ = o[j-^^, ) -S. 

APPENDIX D: STOCHASTIC BLOCKMODEL 

Proof of Lemma 3.1. First, construct the matrix Bl £ j^kxk g^^j^ 
that ^ = ZBlZ^. Define Db = diag(SZ'^l„) G TZ'''"' where 1„ is a vector 
of ones in TZ^. For any i,j, 

Define Bl = Dg'''^BDg'''^ . It follows that if,j = {ZBlZ'^^j and thus ^ = 
ZBlZ^. 

Because B is symmetric, so are B^ and [Z Z)^''^ Bl{Z Z)^''^ . Notice 
that 

deti{Z^Z)^/^BLiZ^Z)'/^) = det((Z^Z)i/2) det(i?L) det((Z^Z)i/2) > 0. 
By eigenvector decomposition, define V G Ji^'x'^ and diagonal matrix A G 

^kxk g^j.]^ ^]^^^ 

(D.l) (Z^Z)V2ij^(z^z)i/2 = yAy^. 

Because the determinant of the left-hand side of (D.l) is greater than zero, 
none of the eigenvalues in A are equal to zero. Left multiply (D.l) by 
Z(Z'^Z)-V2 and right multiply by {Z'^ Z)-'^/^Z^ . This shows 

(D.2) ZBlZ'^ = ZfiA{Zfif, 

where /_* = {Z^Zy^^^V. Notice that (2'/_f)^(Z/^) = 4, the k x k identity 
matrix. So, right multiplying (D.2) by Zfx shows that the columns of Zfx are 
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eigenvectors of ZB^Z = ££ with the eigenvalues down the diagonal of A. 
Equation (D.2) shows that these are the only nonzero eigenvalues. 
It remains to prove equivalence statement (3.2). Notice 

det(^) = det((Z^Z)-^/2) det(F) > 0. 
So, [i"^ exists and statement (3.2) follows. D 

The following is a proof of Lemma 3.2. 

Proof of Lemma 3.2. The following statement is the essential ingre- 
dient to prove Lemma 3.2. 

(D.3) Zi 7^ Zj then ||zj/i — -Zj/i||2 > yl/P. 

The proof of statement (D.3) requires the following definition: 



Notice that 



l/^llm= min lk/"||2- 

X : ||a;||2=l 



\CiO - Zjfl\\2 > \\zifl - Zj^\\2 - \\CiO - Zifl\\2 > 



WlJ'Wm = ™™ xfifi X = min x{Z Z) x = 1/P. 

x: ||x||2=l x: ||a;||2=l 

So, 

\\ZiH - Zjfl\\2 = \\{Zi - Zj)fl\\2 > V2\\fl\\m = \/2/P. 

Proving statement (D.3). The proof of Lemma 3.2 follows: 

T 1 lY _ 1 
P~2\P~ ^/2P' □ 

Proof of Theorem 3.1. Define X £ K^^'^ to contain the eigenvectors 
of L corresponding to the largest k eigenvalues and define 

C= argmin ||X-M|||,, 

Me.^{n,k) 

where ^{n, k) is defined as: 

M{n, k) = {M G i?"^ : M has no more than k unique rows}. 
Notice that 

min ||X — M||^= min > min ||xj — ?7i„||2. 

Me^{n,k) {mi,...,mk}C'R.'' ^-^ 9 

This shows that the ith row of C is equal to Cj as defined in Definition 3. 
Because ZfiO E ^(n. A:), notice that 

(D.4) \\X-C\\2<\\X-ZfiO\\2. 

By the triangle inequality and inequality D.4, 

\\c - z^ioy < \\c - x\\2 + \\x - z/iO||2 < 2\\x - zt^oh. 
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In the notation of Theorem 2.2, define Sn = [A| /2,2]. Then, 6 = 6' = 
A| /2. By assumption, n~^/^(log?i)^ = 0(A|, ) = 0{inm{6,6'}). This imphes 
that the results from Theorem 2.2 hold. Putting the pieces together, 



2 



<^l<2P„^||Q-z,//Of 

<2Pn\\C-ZnOfF 

<8Pn\\X-ZnO\\l 
'P„(logn)2^ 



nKjri J "" □ 



In the second example of Section 3, it was claimed that 

1 



A. 



k{r/p) + l' 



The following is a proof of that statement. 
Define B gTZ'"'"' such that 

B = plk + rlkll, 

where Ik S 'R/''^^ is the identity matrix, 1^ S TZ is a vector of ones, r S (0, 1) 
and p € (0, 1 — r). Assume that p and r are fixed and k can grow with n. 
Let Z G {0,1}"^ be such that Z In = slk- This guarantees that all k 
groups have equal size s. The Stochastic Blockmodel in the second example 
of Section 3 has the population adjacency matrix, W = ZBZ^ . 
Define 

BL = —^{pIk + rlkll). 
nr + sp 

From the argument in the proof of Lemma 3.1, ^ has the same nonzero 
eigenvalues as (Z'^Z)V25^(^t^)1/2 ^ ^fcxfc Let Ai, . . . , A^ be the eigenval- 
ues of [Z'^Zyl'^BLiZ'^ZY/^ = {s^l'^Ik)BL{s^/^Ik) = sBl. Notice that 1^ is 
an eigenvector with eigenvalue 1: 

sBlu = — - — {pik + riaDifc = ^i^±Mi^ = 1^. 

nr + sp nr + sp 

Let Ai = 1. Define 

U = {u:\\u\\2 = l,u^l = 0}. 
Notice that for all u^U, 

(D.5) sBlu= {plk + rlkll)u '^ ~ 



nr -\- sp nr + sp 
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Equation (D.5) implies that for i> 1, 



sp 
A?; 



This is also true for i = k. 



nr + sp 
sp sp 



nr + sp nr + sp k{r/p) + 1 
This is the smallest nonzero eigenvalue of .jSf . 
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